66 research outputs found
From Averaging to Acceleration, There is Only a Step-size
We show that accelerated gradient descent, averaged gradient descent and the
heavy-ball method for non-strongly-convex problems may be reformulated as
constant parameter second-order difference equation algorithms, where stability
of the system is equivalent to convergence at rate O(1/n 2), where n is the
number of iterations. We provide a detailed analysis of the eigenvalues of the
corresponding linear dynamical system , showing various oscillatory and
non-oscillatory behaviors, together with a sharp stability result with explicit
constants. We also consider the situation where noisy gradients are available,
where we extend our general convergence result, which suggests an alternative
algorithm (i.e., with different step sizes) that exhibits the good aspects of
both averaging and acceleration
Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression
We consider the optimization of a quadratic objective function whose
gradients are only accessible through a stochastic oracle that returns the
gradient at any given point plus a zero-mean finite variance random error. We
present the first algorithm that achieves jointly the optimal prediction error
rates for least-squares regression, both in terms of forgetting of initial
conditions in O(1/n 2), and in terms of dependence on the noise and dimension d
of the problem, as O(d/n). Our new algorithm is based on averaged accelerated
regularized gradient descent, and may also be analyzed through finer
assumptions on initial conditions and the Hessian matrix, leading to
dimension-free quantities that may still be small while the " optimal " terms
above are large. In order to characterize the tightness of these new bounds, we
consider an application to non-parametric regression and use the known lower
bounds on the statistical performance (without computational limits), which
happen to match our bounds obtained from a single pass on the data and thus
show optimality of our algorithm in a wide variety of particular trade-offs
between bias and variance
Optimal Rates of Statistical Seriation
Given a matrix the seriation problem consists in permuting its rows in such
way that all its columns have the same shape, for example, they are monotone
increasing. We propose a statistical approach to this problem where the matrix
of interest is observed with noise and study the corresponding minimax rate of
estimation of the matrices. Specifically, when the columns are either unimodal
or monotone, we show that the least squares estimator is optimal up to
logarithmic factors and adapts to matrices with a certain natural structure.
Finally, we propose a computationally efficient estimator in the monotonic case
and study its performance both theoretically and experimentally. Our work is at
the intersection of shape constrained estimation and recent work that involves
permutation learning, such as graph denoising and ranking.Comment: V2 corrects an error in Lemma A.1, v3 corrects appendix F on unimodal
regression where the bounds now hold with polynomial probability rather than
exponentia
Saddle-to-Saddle Dynamics in Diagonal Linear Networks
In this paper we fully describe the trajectory of gradient flow over diagonal
linear networks in the limit of vanishing initialisation. We show that the
limiting flow successively jumps from a saddle of the training loss to another
until reaching the minimum -norm solution. This saddle-to-saddle
dynamics translates to an incremental learning process as each saddle
corresponds to the minimiser of the loss constrained to an active set outside
of which the coordinates must be zero. We explicitly characterise the visited
saddles as well as the jumping times through a recursive algorithm reminiscent
of the LARS algorithm used for computing the Lasso path. Our proof leverages a
convenient arc-length time-reparametrisation which enables to keep track of the
heteroclinic transitions between the jumps. Our analysis requires negligible
assumptions on the data, applies to both under and overparametrised settings
and covers complex cases where there is no monotonicity of the number of active
coordinates. We provide numerical experiments to support our findings
On Convergence-Diagnostic based Step Sizes for Stochastic Gradient Descent
Constant step-size Stochastic Gradient Descent exhibits two phases: a
transient phase during which iterates make fast progress towards the optimum,
followed by a stationary phase during which iterates oscillate around the
optimal point. In this paper, we show that efficiently detecting this
transition and appropriately decreasing the step size can lead to fast
convergence rates. We analyse the classical statistical test proposed by Pflug
(1983), based on the inner product between consecutive stochastic gradients.
Even in the simple case where the objective function is quadratic we show that
this test cannot lead to an adequate convergence diagnostic. We then propose a
novel and simple statistical procedure that accurately detects stationarity and
we provide experimental results showing state-of-the-art performance on
synthetic and real-world datasets
- …